-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A run context manager #484
A run context manager #484
Conversation
datalad_next/runners/run.py
Outdated
timeout: float | None = None, | ||
terminate_time: int | None = None, | ||
kill_time: int | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a precise definition in the documentation.
What is the difference between timeout
and terminate_time
?
Is kill_time
on top of terminate_time
?
What is the semantic difference that leads to one being called timeout
and the other just be time
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to add a few more comments after having tried to exercise this through in a concrete use case.
It is unclear where the timeouts apply, and thereby it is unclear how they must be used.
Example: I have a process that downloads some information. This information in reported in a json-lines format. The process is batch-mode of some kind (an input triggers a response, continued until stdin closes).
Now there are several layers a user of this helper needs to be aware of.
- the layer that feeds the input/request
- the process that downloads and eventually reports a JSON-line
- the protocol that assembles a JSON-line and decodes it, once complete
Let's assume the remote bandwidth is really low (few bytes per second), and the information that needs to go into a single JSON-line is a few kb, so it needs minutes to assemble a response, and have the "outer" protocol generator yield something.
What would the timeouts apply to? I assume time0 is defined by the input item being fed to the process. Is it now the inner layer (i.e. nothing coming down the remote connection), or the outer layer (no JSON-line-based result yielded for X seconds) that would timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use a generator-protocol, then the timeouts that aid the terminating and killing mechanism are only triggered when the context is left. I extended the documentation to clarify that. So for stdout_batchcommand
and annexjson_batchcommand
, the timeout-counting starts when the context is left, which means when the code in the context decides that all operations are done and exits the context.
Timeout-handling for the transmission should be done in the protocol that is used, e.g. StdOutCaptureGeneratorProtocol
. For example, the protocol might send timeout-messages to the result generator and the in-context code can decide what to do. In another example, if the protocol raises an exception in its timeout
-callback, the batchcommand
-context would be left and the terminating- and killing- timeouts would be started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the first paragraph and appreciate the clarification.
I have difficulties following the reasoning in the 2nd paragraph. When you say "should be done", you are saying "it is not implemented anywhere"? Do we have any protocol implementation that behaves like this? If not, how do we know that it would work?
fcebb91
to
33d6634
Compare
1f49f8e
to
446159e
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #484 +/- ##
==========================================
+ Coverage 92.41% 92.55% +0.13%
==========================================
Files 126 138 +12
Lines 9577 9774 +197
Branches 1036 1086 +50
==========================================
+ Hits 8851 9046 +195
- Misses 704 712 +8
+ Partials 22 16 -6
☔ View full report in Codecov by Sentry. |
I posted another round of comments. I am not sure if that is useful, I will stop until I got feedback. FTR: At 28 commits it has reached my attention span limit. It would be useful for reviewing, if the PR is rebased and the commits resemble more reviewable units (new features and their tests, adoption of not features in previously existing code). ATM the commits resemble more the development history of the features themselves. |
Will do that as soon as the batch code is settled |
This commit adds a run-context-manager that can execute subprocesses and can guarantee that a subprocess is terminated some time after exiting the context. To have a guarantee that the subprocess is terminated, the user has to set `kill_time` and preferably also `terminate_time`.
This commit uses the run-context-manager and the data processing pipeline with a pattern_processor in `url_operations/ssh.py`.
Thie commit uses the data-processing-pipeline in `datalad_next/runners/tests/test_run.py`.
This commit addresses comment datalad#484 (comment) about cluttering the namespace. The commit introduces a convention that is currently only used in the protocol class `KillWrapper`. The single keyword argument of the class is prefixed with `dl_`. We should claim the name space `dl_*` for keyword arguments in the protocols that we define. This commit also fixes the return value doc-string description of `_create_kill_wrapper`.
with run( | ||
cmd=[ | ||
'git', 'ls-files', | ||
# we rely on zero-byte splitting below | ||
'-z', | ||
# otherwise take whatever is coming in | ||
*args, | ||
], | ||
protocol_class=StdOutCaptureGeneratorProtocol, | ||
stdin=DEVNULL, | ||
cwd=path | ||
) as r: | ||
# This code uses the data processor chain to process data. This fixes | ||
# a problem with the previous version of the code, where `decode` was | ||
# used on every data chunk that was sent tp `pipe_data_received`. But | ||
# data is chunked up randomly and might be split in the middle of a | ||
# character encoding, leading to weird errors. | ||
yield from process_from( | ||
data_source=r, | ||
processors = [ | ||
decode_processor('utf-8'), | ||
splitlines_processor(separator='\0', keep_ends=False) | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with run( | |
cmd=[ | |
'git', 'ls-files', | |
# we rely on zero-byte splitting below | |
'-z', | |
# otherwise take whatever is coming in | |
*args, | |
], | |
protocol_class=StdOutCaptureGeneratorProtocol, | |
stdin=DEVNULL, | |
cwd=path | |
) as r: | |
# This code uses the data processor chain to process data. This fixes | |
# a problem with the previous version of the code, where `decode` was | |
# used on every data chunk that was sent tp `pipe_data_received`. But | |
# data is chunked up randomly and might be split in the middle of a | |
# character encoding, leading to weird errors. | |
yield from process_from( | |
data_source=r, | |
processors = [ | |
decode_processor('utf-8'), | |
splitlines_processor(separator='\0', keep_ends=False) | |
] | |
) | |
with run( | |
cmd=[ | |
'git', 'ls-files', | |
# we rely on zero-byte splitting below | |
'-z', | |
# otherwise take whatever is coming in | |
*args, | |
], | |
protocol_class=StdOutCaptureGeneratorProtocol, | |
processors = [ | |
decode_processor('utf-8'), | |
splitlines_processor(separator='\0', keep_ends=False) | |
], | |
cwd=path, | |
) as r: | |
yield from r |
I brought it up before, but still think this usage should not be more complex than this. If run()
must not have processors=
it could be renamed and the replacement could have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think I missunderstood your earlier comments.
I have a commit which defines a StdOutCapturePipelineGeneratorProtocol
. The protocol takes an optional processor list as a keyword-argument and sends the process-pipeline results to the generator. This is transparent to the run-context-manager and achieves the same terse "ls-files"-implementation, i.e. the code in the context is just:
yield from r
I did not yet put it in because it requires the user of run
to select the "right" protocol and to provide a processor list in the protocol_kwargs
. That seemed complex to me. And, in contrast to your suggestion (put the pipeline processing into run
), which should support any protocol, my approach requires support by the protocol. It would be transparent for the run
-context manager code though.
Your suggestion should in principle support any protocol. It would require different treatment of generator- and non-generator protocols, if we want to apply pipelines to non-generator protocols as well. The latter would require the result of the non-generator protocols to adhere to a fixed convention. That seems the be the case, all non-generator protocol results are a dictionary with the keys stdout
and stderr
. If we want to support processor pipelines on non-generator protocols, the run
-context manager would have to pass the values of result_dict['stdout']
and possibly result_dict['stderr']
through the processing pipeline. This would also require a way to specify processor lists for stdout and stderr. Alternatively, we could limit the processors
-keyword validity to generator protocols and stdout only.
I am not sure what would be the best approach here. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After having worked with adding processing pipelines to non-generator protocols, I would say: "let's not support processing pipelines in non-generator protocols". The non-generator protocols are very much centered around joining individual byte-chunks in their pipe_data_received
-method and decoding them into strings in their _prepare_result
-method. Without overwriting the _prepare_result
-method, it is not even possible to receive undecoded byte-content from a subprocess.
To make use of arbitrary pipelines, pipe_data_received
and _prepare_result
have to be overridden. That would introduce another "subtree" of protocol-class definitions. Alternatively, we could introduce a "synchronized"-run mode, that would collect all results from a generator-protocol-based run
and return them. In fact, such an approach could make the tree of non-generator-protocol definitions obsolete, any generator-protocol could be used in a non-generator way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I brought the processing protocol classes back. Using those classes the code that executes git ls-files
looks like this:
with run(
cmd=[
'git', 'ls-files',
# we rely on zero-byte splitting below
'-z',
# otherwise take whatever is coming in
*args,
],
protocol_class=StdOutCaptureProcessingGeneratorProtocol,
stdin=DEVNULL,
cwd=path,
protocol_kwargs=dict(
processors=[
decode_processor('utf-8'),
splitlines_processor(separator='\0', keep_ends=False)
]
)
) as r:
# This code uses the data processor chain to process data. This fixes
# a problem with the previous version of the code, where `decode` was
# used on every data chunk that was sent tp `pipe_data_received`. But
# data is chunked up randomly and might be split in the middle of a
# character encoding, leading to weird errors.
yield from r
""" This module contains data processors for the data pipeline processor | ||
|
||
The data processors contained here are: | ||
|
||
- decode_utf8_processor | ||
|
||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this documentation is accessible in the renderer docs. This hides the fact that all the data_processors
modules are documented in the exact same way, making them indistinguishable from each other.
I think the pattern to manually list the main components in a module in the top-level docstring is not optimal. In the renderer documentation this would e automatic/obvious already. Here it adds content that needs to be kept in sync. I believe an __all__
declaration would be a more appropriate alternative.
In general I think that that sorting of code in all the modules is artificially creating the need to describe what matters most. The "public" component tends to be last in the file (although it could be first).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...]
In general I think that that sorting of code in all the modules is artificially creating the need to describe what matters most. The "public" component tends to be last in the file (although it could be first).
An old C/C++-habit that is seemingly difficult to shake. But I will manage :-)
This commit adds pipeline processing generator protocols. The protocols can be initialized with a list of processors that they will apply to the data that they receive. There is currently no equivalent for non-generator protocols. The reason for that is that those protocols expect every data element that they handle to be `bytes`. That does not fit well with processors like `splitlines_processor` or `decode_processor`. We could could implement a new non-generator protocol "family" that collects whatever comes from a processor pipeline as the values of `stdout`- and `stderr`-keys of the result dictionary. But it is currently not clear when and where that would be used.
9720976
to
c42c588
Compare
3159eff
to
4e0ce6a
Compare
I want to take one aspect from above and bring it up for a dedicated discussion:
What about "let's not support non-generator protocols" instead? This PR adds 2k+ lines already. This is ~10% of the entire code base. It is a LOT. I get the impression that some of that is just due to "need to support old stuff that happened to be done". But this is not what this extension is about. If we can find a paradigm that can be supported with a lean implementation that does not multiply the code base, I would very much prefer that. I do not mind is that has consequences, as long as they are spelled out. What would happen, if only generator-protocols are supported? What would be the performance impact on real use cases (I am specifically not talking about theory -- I understand that thread-management adds complexity). |
That is a good question. I think it is very interesting idea (I had a similar hunch when I wrote the developer handbook). From the top of my head, an probably incomplete initial answer:
If we only support generator-protocols, we might get away with exactly one protocol that just forwards all events to a result generator. Check the I think it is conceptually simpler because there are no more "opaque" operations performed in protocol classes. The drawback is that it would require new abstractions to write nice, correct, and re-usable code that processes the events in the The sweet spot is probably somewhere in the middle. The
Within
|
This commit addresses comment: datalad#484 (comment) It removes unrendered doc strings and reorders data processor code so that the data processor is on top of the source files. In addtion it adds `__all__` variables to limit imports to the objects that constitute the user-interface. The commit also adds doc-string rendering for the module `datalad_next.runners.data_processors` This includes markup fixes
This commit addresses comment: datalad#484 (comment) It removes unrendered doc strings and reorders data processor code so that the data processor is on top of the source files. In addtion it adds `__all__` variables to limit imports to the objects that constitute the user-interface. The commit also adds doc-string rendering for the module `datalad_next.runners.data_processors` This includes markup fixes
721775f
to
6d36094
Compare
This commit addresses comment: datalad#484 (comment) It removes unrendered doc strings and reorders data processor code so that the data processor is on top of the source files. In addtion it adds `__all__` variables to limit imports to the objects that constitute the user-interface. The commit also adds doc-string rendering for the module `datalad_next.runners.data_processors` This includes markup fixes
6d36094
to
a9171a6
Compare
This commit addresses comment: datalad#484 (comment) It removes unrendered doc strings and reorders data processor code so that the data processor is on top of the source files. In addtion it adds `__all__` variables to limit imports to the objects that constitute the user-interface. The commit also adds doc-string rendering for the module `datalad_next.runners.data_processors` This includes markup fixes
a9171a6
to
134c0d0
Compare
This commit addresses comment: datalad#484 (comment) It removes unrendered doc strings and reorders data processor code so that the data processor is on top of the source files. In addtion it adds `__all__` variables to limit imports to the objects that constitute the user-interface. The commit also adds doc-string rendering for the module `datalad_next.runners.data_processors` This includes markup fixes
134c0d0
to
3f427dc
Compare
Registering #497 (comment) here. I did not expect that, but I have to admit that I lost track a bit. |
This PR is superseded by PR #516. I will close it for now. |
This PR adds a run-context manager. It is currently a draft PR, because there are open TODOs, but I wanted to start discussions early and would like to highlight that any input on this PR is appreciated.
What's in it
The run-context manager starts a subprocess and returns either a result structure or a generator to the code executed in the context.
If the protocol that is used to create the context does not inherit
GeneratorProtocolMixIn
, the context manager will execute the specified subprocess and, if the subprocess exits, return the result ofprotocol._prepare_result()
in theas
-variable.If the protocol that is used to create the context does inherit
GeneratorProtocolMixIn
, the context manager will start the execution of the process and return the resulting generator in theas
-variable. The generator can be used to fetch output from the subprocess and to trigger protocol callbacks, e.g.pipe_data_received
,timeout
etc. If the code exits the context, the context-manager will executetuple()
on the generator to ensure that the subprocess exit is properly handled, i.e. protocol-callbacks are called and the return code of the subprocess is fetched.The context handler supports the specification of terminate- and/or kill-times. If those are specified the context manager might send a terminate and/or a kill signal to the subprocess. The following happens if terminate- and/or kill-time are specified:
The PR comes with a number of tests that demonstrate the behavior.
Data processors
The PR includes an implementation of a generator that can be configured to read data from a base generator, perform different steps of data processing, and yield the result of the last processing step to the caller. There are currently three data processors:
Which processors are used is defined by putting them into a list. For example, the list
[decode_processor, splitlines_processor]
specifies that the content that is received from the underlying generator should first be decoded and then split into lines. The processing is "stream"-oriented, which means that data is returned as soon as possible, e.g. each individual line is returned the caller as soon as it was read.Implementation remark
The run-context manager creates a subclass of the provided protocol class in order to amend the provided protocol class with the kill-logic. It uses this subclass to instantiate a runner. This construction allows the user to use any protocol class without the need to define yet another class, while it ensures that the class that is used by the runner supports the kill-logic.
TODO